System, method and computer program product for determining sizes and/or 3d locations of objects imaged by a single camera

ABSTRACT

A system, method or computer program product for estimating an absolute 3D location of at least one object x imaged by a single camera, the system including processing circuitry configured for identifying an interaction, at time t, of object x with an object y imaged with said object x by said single camera, typically including logic for determining object y&#39;s absolute 3D location at time t, and providing an output indication of object x&#39;s absolute location at time t, derived from the 3D location, as known, at time t, of object y.

FIELD OF THIS DISCLOSURE

The present invention relates generally to image processing, and moreparticularly to processing images generated by a single camera.

BACKGROUND FOR THIS DISCLOSURE

A method for determining the extent of an object n one direction isdescribed e.g. in the following

https://www.pyimagesearch.com/2015/01/19/find-distance-camera-objectmarker-using-python-opencv/

Tennis analytic methods are described e.g. in the following link:

https://github.com/vishaltiwari/bmvc-tennis-analytics.

Methods for fusing sensor data with visual data to estimate orientationof a ground plane are described in:

https://ap.isr.uc.pt/wp-content/uploads/mrl_data/srchive/124.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5676736/

Other technologies useful in conjunction with certain embodimentsprovided herein, are described in:

http://www.mva-org.jp/Proceedings/2013USB/papers/04-22.pdf.

Other known technologies include:

-   [1] Li, Sijin, and Antoni B. Chan. “ 3D human pose estimation from    monocular images with deep convolutional neural network.” Asian    Conference on Computer Vision. Springer, Cham, 2014.-   [2] Tekin, Bugra, et al. “Structured prediction of 3D human pose    with deep neural networks.” arXiv preprint arXiv: 1605.05180 (2016).-   [3] Mehta, Dushyant, et al, “Vnect: Real-time 3D human pose    estimation with a single rgb camera.” ACM Transactions on Graphics    (TOG) 36.4 (2017): 44.-   [4] Lin, Mude, et al. “Recurrent 3D pose sequence machines.”    Proceedings of the IEEE Conference on Computer Vision and Pattern    Recognition. 2017.-   [5] Tome, Denis, Chris Russell, and Lourdes Agapito. “Lifting from    the deep: Convolutional 3D pose estimation from a single image.”    Proceedings of the IEEE Conference on Computer Vision and Pattern    Recognition. 2017.-   [6] Zhou, Xingyi, et al. “Towards 3D human pose estimation in the    wild: a weakly-supervised approach.” Proceedings of the IEEE    International Conference on Computer Vision. 2017.-   [7] Sun, Xiao, et al. “Integral human pose regression.” Proceedings    of the European Conference on Computer Vision (ECCV). 2018.-   [8] Martinez, Julieta, et al. “A simple yet effective baseline for    3d human pose estimation.” Proceedings of the IEEE international    Conference on Computer Vision. 2017.-   [9] Pavllo, Dario, et al. “3D human pose estimation in video with    temporal convolutions and semi-supervised training.” arXiv preprint    arXiv:1811.11742 (2018).-   [10] Zhang. A Flexible New Technique for Camera Calibration. IEEE    Transactions on Pattern Analysis and Machine Intelligence,    22(11):1330-1334, 2000.-   [11] Nath, Tanmay, et al. “Using DeepLabCut for 3D markerless pose    estimation across species and behaviors.” Nature protocols (2019).-   [12] Doosti, Bardia. “Hand Pose Estimation: A Survey.” arXiv    preprint arXiv:1903.01013 (2019).-   [13] Malik, Jameel, Ahmed Elhayek, and Didier Stricken “Simultaneous    hand pose and skeleton bone-lengths estimation from a single depth    image.” 2017 International Conference on 3D Vision (3DV). IEEE,    2017.-   [14] Li, Ruotong, et al. “Constraint-Based Optimized Human Skeleton    Extraction from Single-Depth Camera.” Sensors 19.11 (2019): 2604.-   [15] Nägeli, Tobias, et al. “Flycon: real-time    environment-independent multi-view human pose estimation with aerial    vehicles.” SIGGRAPH Asia 2018 Technical Papers. ACM, 2018.-   [16] Am.ab, Anurag, Carl Doersch, and Andrew Zisserman. “Exploiting    temporal context for 3D human pose estimation in the wild.”    Proceedings of the IEEE Conference on Computer Vision and Pattern    Recognition. 2019.

US20110135157A1 is a patent document in which a person's relativelocation estimated by assuming an average person size.

-   -   US20080152191A1 is a patent document which uses stereo cameras        to estimate a person's absolute location. EP2383696A1 is a        patent document which describes a 3D pose estimator which relies        on multiple cameras. Methods that explicitly resolve person size        constraints, using multiple or stereo-cameras, are known also        from the above references 13-15. The disclosures of all        publications and patent documents mentioned in the        specification, and of the publications and patent documents        cited therein directly or indirectly, are hereby incorporated by        reference other than subject matter disclaimers or disavowals.        If the incorporated material is inconsistent with the express        disclosure herein, the interpretation is that the express        disclosure herein describes certain embodiments, whereas the        incorporated material describes other embodiments. Definition/s        within the incorporated material may be regarded as one possible        definition for the term's in question.

SUMMARY OF CERTAIN EMBODIMENTS

Certain embodiments of the present invention seek to provide circuitrytypically comprising at least one hardware processor in communicationwith at least one memory, with instructions stored in such memoryexecuted by the processor to provide functionalities which are describedherein in detail. Any functionality described herein may befirmware-implemented, or processor-implemented, as appropriate. Certainembodiments seek to determine (a set of) typically absolute 3D locationsof (respective joints of) an object (or more) imaged by only a singlecamera. Since only one camera is available, the actual sizes of some,many or most imaged objects e.g. persons, are generally not known, sincethe imaged size is a variable riot directly related to the actual sizevariable and is instead related to actual size via another variablenamely the distance of the object from the camera or depth. However, theactual size of certain objects may be known e.g., just by way ofexample, a floor on which a person is standing or a ball which a personhandles (catches, holds or otherwise interacts with), or a manufacturedobject whose size is known such as a recognizable vehicle or toy orcommodity or appliance or tool. Or, the actual location of certainobjects may be known e.g. the location of a given floor or other groundplane; it is appreciated that any uses of an object's known sizedescribed herein e.g. to derive sizes and/or locations of other objects,may if desired be replaced, mutatis mutandis, by uses of an object'sknown location, or vice versa.

Certain embodiments seek to determine a (typically absolute) 3D locationof an object x by identifying an interaction between that object andanother object y whose absolute 3D location is known. If thisinteraction occurs at time t, then at time t, the 3D location of objectx is the known 3D location of object y. For example, if a person isknown to be standing on (interacting with) the floor at time t, and thefloor's typically absolute location and/or size is known, the size,and/or absolute location at time t, of the person may be determined.

Typically a person's absolute location is provided as a “pose” e.g. the3D absolute location of each (or some) of the person's joints.

The following terms may be construed either in accordance with anydefinition thereof appearing in the prior art literature or inaccordance with the specification, or to include in their respectivescopes, the following:

Articulated Objects vs. Secondary (aka Non-Articulated) Objects:

This may be e.g. a person and ball respectively. The secondary object isused to localize or size-estimate the first object.

Absolute (aka True) vs. Relative 3D Locations:

Relative 3D locations, or 3D locations with unknown scale factor, aretypically unitless and typically comprise coordinates with unknownscaling; typically, these coordinates are relative to an average personsize. Relative 3d locations are indicative of direction; e.g. mayindicate that a person is pointing towards a camera rather than away;but are not indicative of, say, how distant the person's hand is fromthe camera. Relative 3D joint locations are helpful for actionrecognition, because movement is roughly invariant over person size. Incontrast, absolute 3d locations yield information on what a person isdoing, as well as his location.

Relative coordinates of a point set with ground-truth coordinates Qtypically means that for every i, j there is an angle between p_i p_jand q_i q_j that equals zero. Thus, the predicted angles are allcorrect.

A method that predicts absolute coordinates P ensures that the properscaling is found, therefore p_i=q_i.

Absolute 3D locations are real or physical 3D locations typicallyprovided relative to a real-world coordinate system, e.g. in meters. Theabsolute locations of an articulated object typically comprise a set of3D vectors (x, y, z). This set typically represents the 3D location ofall or at least some of the joints of the articulated object.

2d locations: within a frame; typically indicated in pixels.Pose: This includes the pose e.g. of an articulated object e.g. aperson. This typically includes the absolute 3d locations, of allarticulated objects, moving parts, or of all of a person's (a humanbody's) components e.g. joints and/or limb portions. For example, thepose simply is 3d coordinates, e.g. in meters, for each joint in thebody e.g. xyz coordinates, e.g. in meters, of a tracked person's neck,and of her or his right and left ankle, knee, wrist and elbow joints.Predict=determine=compute=estimate: These are used generallyinterchangeably herein.Limbs: e.g. an arm or a leg, each of which have upper and lowerportions, as opposed to joints.Joints: between typically rigid members. e.g. elbows and knees, whichare positioned between upper and lower portions of an arm or legrespectively.Person size: this may include a collection of plural lengths along thekinematic chain representing the person.Interaction: when two objects are at a single absolute location e.g.because one object is touching or handling the other (two people dancingwith one another, a person handling a ball of a known size, a person'sfeet standing on the ground or floor, for example).Kinematic length: distance or length between two joints that arephysically connected. “action recognition framework”: software operativeto detect the moment or point in time at which a first object, e.g.person, is touching a second object e.g. non-articulated object, e.g.with known size. The input to the software typically comprises asequence of images and the output typically comprises locations in time(e.g. frame number) and/or location (typically in pixels) within theframe at which the person (say) has touched the ball (say).Kinematic pairs: A tree may include plural “kinematic pairs”. Each suchpair typically comprises a computational model of a connection, or joint(e.g. elbow), between two connected links (e.g. the upper and lower armconnected at the elbow). The computational model provided by a“kinematic pair” reflects the nature of the joint between a pair ofconnected links e.g. whether the joint is hinged, sliding, etc.

It is appreciated that any reference herein to, or recitation of, anoperation being performed is, e.g. if the operation is performed atleast partly in software, intended to include both an embodiment wherethe operation is performed in its entirety by a server A, and also toinclude any type of “outsourcing” or “cloud” embodiments in which theoperation, or portions thereof, is or are performed by a remoteprocessor P (or several such), which may be deployed off-shore or “on acloud”, and an output of the operation is then communicated to, e.g.over a suitable computer network, and used by server A. Analogously, theremote processor P may not, itself, perform all of the operations, and,instead, the remote processor P itself may receive output/s of portion/sof the operation from yet another processor/s P′, may be deployedoff-shore relative to P, or “on a cloud”, and so forth.

There is thus provided, in accordance with at least one embodiment ofthe present invention,

The present invention typically includes at least the followingembodiments:

Embodiment 1. A system (or method or computer program product) forestimating an absolute 3D location of at least one object x imaged by asingle camera, the system including processing circuitry configured foridentifying an interaction, at time t, of object x with an object ytypically imaged with the object x typically by the single camera,typically including logic for determining object y's absolute 3Dlocation at time t, and/or providing an output indication of object x'sabsolute location at time t, typically derived from the 3D location, asknown, at time t, of object y.

Embodiment 2. A system according to any of the preceding embodimentswherein the object x comprises a first joint interconnected via a limbof fixed length to a second joint and wherein the system determines thefixed length and then determines the second joint's absolute 3D locationat time t to be a location L whose distance from object x's absolutelocation at time t, is the fixed length.

Embodiment 3. A method for providing an output indication of absolute 3Dlocations of objects imaged by a single camera, the method includingusing a hardware processor for performing, at least once, all or anysubset of the following operations:

-   -   a. providing Relative 3D joint locations for an object x having        joints J whose absolute locations are of interest; and/or    -   b. providing absolute 3D locations of at least one object y,        which is imaged by the single camera and whose size is known;        and/or    -   c. re-identifying objects x, y over time; and/or    -   d. recognizing interaction of object x as re-identified in        operation c, with the object y of known size as re-identified in        operation c; and/or    -   e. for each interaction recognized in operation d,        -   finding absolute 3D location L of object y with known size,            at time T at which interaction occurred; and/or        -   setting absolute 3D location of joint J to be L; and/or        -   using relative 3D locations as provided in operation a of at            least one joint k other than J relative to 3D location of            joint J, to compute absolute 3D locations of joint k, given            the absolute location 1, of i as found; and/or        -   determining distances between absolute 3D locations of            various joints k, to yield an object size parameter            comprising an estimated length of a limb portion extending            between joints j and k; and/or    -   f. computing a best-estimate from the object sizes estimated in        operation e, for each tracked limb portion; and/or    -   g. using the best-estimate to scale relative 3D joint locations        into absolute 3D joint locations; and/or    -   h. generating an output indication of the absolute 3D joint        locations.

Embodiment 4. A method according to any of the preceding embodimentswherein the object y is a stationary object having a permanent absolute3D location which is stored in memory.

Embodiment 5. A method according to any of the preceding embodimentswherein the object y comprises a ball and wherein data is pre-storedregarding the ball's known conventional size.

Embodiment 6. A method according to any of the preceding embodimentswherein in operation c, the re-identifying objects comprises trackingthe objects.

Embodiment 7. A method according to any of the preceding embodimentswherein in operation f, the computing of a best-estimate includesremoving outliers which differ from a cluster of estimates to a greaterextent, relative to the extent to which the estimates in the clusterdiffer among themselves.

Embodiment 8. A method according to any of the preceding embodimentswherein the method is performed in near-real time.

Embodiment 9. A method according to any of the preceding embodimentswherein in operation h, the output indication serves as an input to aprocessor which identifies interesting moments in the single camera'soutput and alerts end-users of the interesting moments.

Embodiment 10. A method according to any of the preceding embodimentswherein in operation h, the output indication serves as an input to aprocessor which controls insertions of virtual advertisements.

Embodiment 11. A method according to any of the preceding embodimentswherein the object y comprises an object O whose absolute 3d jointlocations are known from operation h by virtue of having previouslyderived absolute 3D locations of object O's joints by performing theoperations a-h.

Embodiment 12. A method according to any of the preceding embodimentswherein the output indication may be sensed by a human user.

Embodiment 13. A system according to any of the preceding embodimentswherein the determining object y's absolute 3D location comprisesretrieving object y's absolute 3d location from memory, because objecty's absolute 3D location is known to the system.

each joint may comprise a joint from among the following group:shoulder, elbow, wrist, knee, ankle.

Embodiment 14. A system according to any of the preceding embodimentswherein the processing circuitry is also configured for: generatingplural estimates of object x's size for plural interactionsrespectively; and/or finding a best estimate from among the estimates,over time, and/or using the best estimate to estimate the absolute 3Dlocation, thereby to handle situations in which object-joint distance isnot exactly 0.

Embodiment 15. A system according to any of the preceding embodimentswherein the object x's absolute location at time t is estimated as beingequal to the 3D location, as known, at time t, of object y.

Embodiment 16. A system according to any of the preceding embodimentswherein the object y's size is known and wherein the logic usesknowledge about the object y's size to derive the object x's absolutelocation at time t.

Embodiment 17. A computer program product, comprising a non-transitorytangible computer readable medium having computer readable program codeembodied therein, the computer readable program code adapted to beexecuted to implement a method, the method comprising any method herein.

Embodiment 18. A system according to any of the preceding embodimentswherein at least one fixed size, known to remain fixed between frames,which characterizes object x is used for converting object x's relative3D joint locations to absolute 3D joint locations for object x.

Embodiment 19. A system according to any of the preceding embodimentswherein the fixed size comprises an actual distance between 2 connectedjoints belonging to object x.

Embodiment 20. A system according to any of the preceding embodimentswherein the converting is performed for at least one frame in whichobject x is not interacting with object y.

Also provided, excluding signals, is a computer program comprisingcomputer program code means for performing any of the methods shown anddescribed herein when the program is run on at least one computer; and acomputer program product, comprising a typically non-transitorycomputer-usable or -readable medium e.g. non-transitory computer-usableor -readable storage medium, typically tangible, having a computerreadable program code embodied therein, the computer readable programcode adapted to be executed to implement any or all of the methods shownand described herein. The operations in accordance with the teachingsherein may be performed by at least one computer specially constructedfor the desired purposes or general purpose computer speciallyconfigured for the desired purpose by at least one computer programstored in a typically non-transitory computer readable storage medium.The term “non-transitory” is used herein to exclude transitory,propagating signals or waves, but to otherwise include any volatile ornon-volatile computer memory technology suitable to the application.

Any suitable processor/s, display and input means may be used toprocess, display e.g. on a computer screen or other computer outputdevice, store, and accept information such as information used by orgenerated by any of the methods and apparatus shown and describedherein; the above processor/s, display and input means includingcomputer programs, in accordance with all or any subset of theembodiments of the present invention. Any or all functionalities of theinvention shown and described herein, such as but not limited tooperations within flowcharts, may be performed by any one or more of: atleast one conventional personal computer processor, workstation or otherprogrammable device or computer or electronic computing device orprocessor, either general-purpose or specifically constructed, used forprocessing; a computer display screen and/or printer and/or speaker fordisplaying; machine-readable memory such as flash drives, optical disks,CDROMs, DVDs, BluRays, magnetic-optical discs or other discs; RAMs,ROMs, EPROMs, EEPROMs, magnetic or optical or other cards, for storing,and keyboard or mouse for accepting. Modules illustrated and describedherein may include any one or combination or plurality of: a server, adata processor, a memory/computer storage, a communication interface(wireless (e.g. BLE) or wired (e.g. USB)), a computer program stored inmemory/computer storage.

The term “process” as used above is intended to include any type ofcomputation or manipulation or transformation of data represented asphysical, e.g. electronic, phenomena which may occur or reside e.g.within registers and/or memories of at least one computer or processor.Use of nouns in singular form is not intended to be limiting; thus theterm processor is intended to include a plurality of processing unitswhich may be distributed or remote, the term server is intended toinclude plural typically interconnected modules running on pluralrespective servers, and so forth.

The above devices may communicate via any conventional wired or wirelessdigital communication means, e.g. via a wired or cellular telephonenetwork or a computer network such as the Internet.

The apparatus of the present invention may include, according to certainembodiments of the invention, machine readable memory containing orotherwise storing a program of instructions which, when executed by themachine, implements all or any subset of the apparatus, methods,features and functionalities of the invention shown and describedherein. Alternatively or in addition, the apparatus of the presentinvention may include, according to certain embodiments of theinvention, a program as above which may be written in any conventionalprogramming language, and optionally a machine for executing the programsuch as but not limited to a general purpose computer which mayoptionally be configured or activated in accordance with the teachingsof the present invention. Any of the teachings incorporated herein may,wherever suitable, operate on signals representative of physical objectsor substances.

The embodiments referred to above, and other embodiments, are describedin detail in the next section.

Any trademark occurring in the text or drawings is the property of itsowner and occurs herein merely to explain or illustrate one example ofhow an embodiment of the invention may he implemented.

Unless stated otherwise, terms such as, “processing”, “computing”,“estimating”, “selecting”, “ranking”, “grading”, “calculating”,“determining”, “generating”, “reassessing”, “classifying”, “generating”,“producing”, “stereo-matching”, “registering”, “detecting”,“associating”, “superimposing”, “obtaining”, “providing”, “accessing”,“setting” or the like, refer to the action and/or processes of at leastone computer/s or computing system/s, or processor/s or similarelectronic computing device/s or circuitry, that manipulate and/ortransform data which may be represented as physical, such as electronic,quantities e.g. within the computing system's registers and/or memories,and/or may be provided on-the-fly, into other data which may besimilarly represented as physical quantities within the computingsystem's memories, registers or other such information storage,transmission or display devices or may be provided to external factorse.g. via a suitable data network. The term “computer” should be broadlyconstrued to cover any kind of electronic device with data processingcapabilities, including, by way of non-limiting example, personalcomputers, servers, embedded cores, computing systems, communicationdevices, processors (e.g. digital signal processor (DSP),microcontrollers, field programmable gate arrays (FPGAs), applicationspecific integrated circuits (ASICs), etc.) and other electroniccomputing devices. Any reference to a computer, controller or processor,is intended to include one or more hardware devices e.g. chips, whichmay be co-located or remote from one another. Any controller orprocessor may for example comprise at least one CPU, DSP, FPGA or ASIC,suitably configured in accordance with the logic and functionalitiesdescribed herein.

Any feature or logic or functionality described herein may beimplemented by processor/s or controller/s configured as per thedescribed feature or logic or functionality, even if the processor/s orcontroller/s are not specifically illustrated for simplicity. Thecontroller or processor may be implemented in hardware, e.g., using oneor more Application-Specific Integrated Circuits (ASICs) orField-Programmable Gate Arrays (FPGAs), or may comprise a microprocessorthat runs suitable software, or a combination of hardware and softwareelements.

The present invention may be described, merely for clarity, in terms ofterminology specific to, or references to, particular programminglanguages, operating systems, browsers, system versions, individualproducts, protocols and the like. It will be appreciated that thisterminology or such reference/s is intended to convey general principlesof operation clearly and briefly, by way of example, and is not intendedto limit the scope of the invention solely to a particular programminglanguage, operating system, browser, system version, or individualproduct or protocol. Nonetheless, the disclosure of the standard orother professional literature defining the programming language,operating system, browser, system version, or individual product orprotocol in question, is incorporated by reference herein in itsentirety.

Elements separately listed herein need not be distinct components andalternatively may be the same structure. A statement that an element orfeature may exist is intended to include (a) embodiments in which theelement or feature exists; (b) embodiments in which the element orfeature does not exist; and (c) embodiments in which the element orfeature exist selectably e.g. a user may configure or select whether theelement or feature does or does not exist.

Any suitable input device, such as but not limited to a sensor, may beused to generate or otherwise provide information received by theapparatus and methods shown and described herein. Any suitable outputdevice or display may be used to display or output information generatedby the apparatus and methods shown and described herein. Any suitableprocessor/s may be employed to compute or generate information asdescribed herein and/or to perform functionalities described hereinand/or to implement any engine, interface or other system illustrated ordescribed herein. Any suitable computerized data storage e.g. computermemory, may be used to store information received by or generated by thesystems shown and described herein. Functionalities shown and describedherein may be divided between a server computer and a plurality ofclient computers. These or any other computerized components shown anddescribed herein may communicate between themselves via a suitablecomputer network.

The system shown and described herein may include user interface/s e.g.as described herein which may for example include all or any subset ofan interactive voice response interface, automated response tool,speech-to-text transcription system, automated digital or electronicinterface having interactive visual components, web portal, visualinterface loaded as web page/s or screen/s from server/s viacommunication network/s to a web browser or other application downloadedonto a user's device, automated speech-to-text conversion tool,including a front-end interface portion thereof and back-end logicinteracting therewith. Thus the term user interface or “UI” as usedherein includes also the underlying logic which controls the datapresented to the user e.g. by the system display and receives andprocesses and/or provides to other modules herein, data entered by auser e.g. using her or his workstation/device.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain embodiments of the present invention are illustrated in thefollowing drawings; in the block diagrams, arrows between modules may beimplemented as APIs and any suitable technology may be used forinterconnecting functional components or modules illustrated herein in asuitable sequence or order e.g. via a suitable API/Interface. Forexample, state of the art tools may be employed, such as but not limitedto Apache Thrift and Avro which provide remote call support. Or, aregular communication protocol may be employed, such as but not limitedto HTTP or MQTT, and may be combined with a standard data format, suchas but not limited to JSON or XML.

Methods and systems included in the scope of the present invention mayinclude any subset or all of the functional blocks shown in thespecifically illustrated implementations by way of example, in anysuitable order e.g. as shown. Flows may include all or any subset of theillustrated operations, suitably ordered e.g. as shown. Tables hereinmay include all or any subset of the fields and/or records and/or cellsand/or rows and/or columns described.

Example embodiments are illustrated in the various drawings.Specifically:

FIG. 1 is a simplified flowchart illustration of a process provided inaccordance with embodiments herein.

FIG. 2 is a table useful for understanding embodiments herein.

FIGS. 3 and 4 are pictorial illustrations useful in understandingcertain embodiments.

Computational, functional or logical components described andillustrated herein can be implemented in various forms, for example, ashardware circuits such as but not limited to custom VLSI circuits orgate arrays or programmable hardware devices such as but not limited toFPGAs, or as software program code stored on at least one tangible orintangible computer readable medium and executable by at least oneprocessor, or any suitable combination thereof. A specific functionalcomponent may be formed by one particular sequence of software code, orby a plurality of such, which collectively act or behave or act asdescribed herein with reference to the functional component in question.For example, the component may be distributed over several codesequences such as but not limited to objects, procedures, functions,routines and programs and may originate from several computer fileswhich typically operate synergistically.

Each functionality or method herein may be implemented in software (e.g.for execution on suitable processing hardware such as a microprocessoror digital signal processor), firmware, hardware (using any conventionalhardware technology such as integrated Circuit technology) or anycombination thereof.

Functionality or operations stipulated as being software-implemented mayalternatively be wholly or fully implemented by an equivalent hardwareor firmware module and vice-versa. Firmware implementing functionalitydescribed herein, if provided, may be held in any suitable memory deviceand a suitable processing unit (aka processor) may be configured forexecuting firmware code. Alternatively, certain embodiments describedherein may be implemented partly or exclusively in hardware in whichcase all or any subset of the variables, parameters, and computationsdescribed herein may be in hardware.

Any module or functionality described herein may comprise a suitablyconfigured hardware component or circuitry. Alternatively or inaddition, modules or functionality described herein may be performed bya general purpose computer or more generally by a suitablemicroprocessor, configured in accordance with methods shown anddescribed herein, or any suitable subset, in any suitable order, of theoperations included in such methods, or in accordance with methods knownin the art.

Any logical functionality described herein may be implemented as a realtime application, if and as appropriate, and which may employ anysuitable architectural option, such as but not limited to FPGA, ASIC orDSP, or any suitable combination thereof

Any hardware component mentioned herein may in fact include either oneor more hardware devices e.g. chips, which may be co-located or remotefrom one another.

Any method described herein is intended to include within the scope ofthe embodiments of the present invention also any software or computerprogram performing all or any subset of the method's operations,including a mobile application, platform or operating system e.g. asstored in a medium, as well as combining the computer program with ahardware device to perform all or any subset of the operations of themethod.

Data can be stored on one or more tangible or intangible computerreadable media stored at one or more different locations, differentnetwork nodes or different storage devices at a single node or location.

It is appreciated that any computer data storage technology, includingany type of storage or memory and any type of computer components andrecording media that retain digital data used for computing for aninterval of time, and any type of information retention technology, maybe used to store the various data provided and employed herein. Suitablecomputer data storage or information retention apparatus may includeapparatus which is primary, secondary, tertiary or off-line; which is ofany type or level or amount or category of volatility, differentiation,mutability, accessibility, addressability, capacity, performance andenergy use; and which is based on any suitable technologies such assemiconductor, magnetic, optical, paper and others.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS

Each element e.g operation described herein may have all characteristicsdescribed or illustrated herein or according to other embodiments, mayhave any subset of the characteristics described herein.

The terms processor or controller or module or logic as used herein areintended to include hardware such as computer microprocessors orhardware processors, which typically have digital memory and processingcapacity, such as those available from, say Intel and Advanced MicroDevices (AMD). any operation or functionality or computation or logicdescribed herein may be implemented entirely or in any part on anysuitable circuitry including any such computer microprocessor/s as wellas in firmware or in hardware or any combination thereof.

It is appreciated that elements illustrated in more than one drawings,and/or elements in the written description may still be combined into asingle embodiment, except if otherwise specifically clarifiedherewithin. It is appreciated that any features, properties , logic,modules, blocks, operations or functionalities described herein whichare, for clarity, described in the context of separate embodiments, mayalso be provided in combination in a single embodiment except where thespecification or general knowledge specifically indicates that certainteachings are mutually contradictory and cannot be combined. any of thesystems shown and described herein may be used to implement or may becombined with, any of the operations or methods shown and describedherein.

FIG. 1 describes a process including operations for determining absolute3d locations of objects of interest. The process of FIG. 1 includes allor any subset of the following operations, suitably ordered e.g. asfollows; all operations and methods described herein may be performedfor example by a suitable hardware processor.

Operation 5

Receive at least one image of articulated objects (e.g. a single imagewhich depicts contact between articulated and non-articulated objects).Typically the input comprises a video stream which typically depictscontact between articulated and non-articulated objects, typicallywithout further input.

Each articulated object typically includes plural rigid object-portionsor links, e.g. human limbs, which are interconnected by joints, whereeach joint interconnecting object portions a, b allows rotational and/ortranslational relative motion between a and b, the relative motion ofeach pair of object-portions, if translational, typically having 1 or 2or 3 degrees of freedom along 1 or 2 or 3 perpendicular axesrespectively.

Operation 7:

Operation 7 may be performed, in which the 2d pose is estimated, using asuitable algorithm such as OpenPose for example,https://github.com/CMU-Perceptual-Computing-Lab/openpose OpenPosedescribes a 2D pose estimator.

Operation 10

Derive estimations of relative 3D joint locations from the (at leastone) image. For example, if the image represents a scene includingplural persons, relative 3D joint locations may be estimated for each ofthe plural persons in the scene. Here, method A will be used.

Method A computes relative 3D locations, which are ambiguous in scaleand position; (e.g. a human body with unknown distances between thebody's joints (along limb portions of unknown lengths), hence cannotdifferentiate an image of a large person from an image of a smallerperson who is closer to the camera.

The output of method A is relative to 3D joint locations, with ambiguousperson size.

Method A typically comprises a 3D pose estimation algorithm using noprior knowledge about person size e.g. the 3D pose estimator describedin https://arxiv.org/pdf/1705.03098.pdf which uses the 2d post estimatedin operation 7, as an input.

Operation 20

Use object tracking or re-identification to track or recognize objectsover time. For example, object/s e.g. all joints in a scene, and/or atleast one secondary or non-articulated objects such as a ball, may befully tracked over time. And/or, any other reconstructed absolute 3Dlocations may be tracked over time, e.g. using simultaneous localizationand mapping (SLAM), which is a known technology for constructing orupdating a map of an unknown environment while simultaneously keepingtrack of an agent's absolute location within that environment, using anysuitable method such as particle filter, extended Kalman filter,Covariance intersection, and GraphSLAM.

Any suitable method may be employed to initially recognize eacharticulated object, such as but not limited to pose estimationalgorithms such as openpose (e.g.https://github.com/cmu-perceptual-computing-lab/openpose).

Any suitable method may be used for tracking e.g. any suitable videotracking technique such as but not limited to kernel-based tracking orcontour tracking.

The tracking technology described in the following publication:

https://arxiv.org/pdf/1606.09549.pdf.

may also be used.

If bounding box tracking functionality (aka bounding box tracker) isused, the tracked bounding box is typically matched against a previouslydetected bounding box. Typically, a track is matched to a (or each)current detection, including assigning each detection to a track orassigning that track's ID to that detection.

A bounding box may be computed for articulated objects, by receiving aset of 2d pixel locations of an object determining the object's minimaland maximal x, y coordinates from among the 2d locations in the set andusing these minimal and maximal coordinates as inputs to the boundingbox tracker. The tracker may, for example, use the bounding boxes offrame t and the image at t+1 as inputs to a function which predictsbounding boxes om frame t+1.

Operation 30

Operation 30 typically employs method b aka requirement ii, whichaccurately predicts absolute 3D locations of the secondary objects,meaning these coordinates are not ambiguous in scale and position (e.g.a. ball whose actual size in cm is known).

Method B typically performs accurate 3D object detection, and may useprior knowledge about object size.

For example, a CNN (Convolutional Neural Network) may be configured forreceiving an image and predicting actions and their localization. Forexample a method for detecting shots of balls in soccer broadcastingvideos is described in S. Jackman, “Football Shot Detection usingConvolutional Neural Networks”, available online athttps://www.diva-portal.org/smash/get/diva2:1323791/FULLTEXT01.pdf Aka[Jackman2019].

Method B may be used to predict absolute 3D locations of secondary e.g.non-articulated objects. Method B is typically operative to predictabsolute 3D locations of secondary objects (e.g. a ball), yielding 3Dcoordinates for the secondary object, whose scale and position areunambiguous.

There exist methods which accurately predict absolute 3D locations forobjects; such methods are described e.g. in

-   -   https://www.pyimagesearch.com/2015/01/19/find-distance-camera-objectmarker-using-python-opencv/;        and/or    -   https://github.com/vishaltiwari/bmvc-tennis-analytics

For example, the absolute 3D location for 2 example objects (groundplane, soccer ball) may be computed as follows:

-   -   i. An AR (augmented reality) toolkit may be used to estimate        absolute 3d location of ground-plane, e.g. as described in        https://blog.markdaws.net/arkit-by-example-part-2-plane-detection-visualization-10f05876d53.    -   ii. Absolute 3D location of a soccer ball may be estimated based        on prior knowledge of conventional/regular ball sizes. Methods        which compare imaged size to real-life size and derive absolute        3D location are described e.g. in

https://www.pyimagesearch.com/2015/01/19/find-distance-camera-objectmarker-using-python-opencv/;and/or

https://github.com/vishaltiwari/bmvc-tennis-analytics: and/or

http://www.mva-org.jp/Proceedings/2013USB/papers/04-22.pdf

Operation 40

Use of an “action recognition framework” to recognize an interaction ofa joint of an articulated object, e.g. person, with an object of knownsize.

The output of operation 40 typically comprises all or any subset of:

a. the number and/or time-stamp of a frame, in a video sequence, inwhich a person's hand (say) touches a secondary object of interest e.g.a ball; and/or

b. x, y location (e.g. pixel) within this frame, at which the hand andball are located.

c. identification of the body limb portion and secondary object whichinteracted in this frame. For example, in the above example, body limbportion=hand; secondary object=ball.

The object of known size may for example be:

-   -   i. Ground-plane on which a person's foot is placed.

The size of the ground-plane may be known (or determined either offlineor in near real time), and, typically, the absolute 3D locations of theboundaries of the ground-plane, using any suitable method. For example,the ground plane's distance from a camera imaging the ground plane, andthe ground-plane's orientation relative to the camera, may be estimatedaccurately, e.g. by an AR toolkit, in at least one frame, typically foreach and every frame.

The ray from the camera which intersects a known 2D pixel location of ajoint at the moment of that joint's interaction with an object of known3D absolute position, e.g. as determined in Operation 40, will intersectthe ground plane at an absolute 3D location L, which may serve as anestimate of the absolute foot location.

Example: the 2D location may be a left foot's 2D pixel location at thetime when the foot comes into contact with a ball that the foot is aboutto kick. Or, the 2D location may be the left foot's 2D location at themoment a person comes into contact with (e.g. lands on, perhapsfollowing a jump) a known ground plane; this may be defined using anequation defining the relationship between absolute coordinates x, y,and z of (say) a ground plane. For example, in the Hessian form,x*a+y*b+z*c=d defines the ground plane with (a, b, c) as a normal vectorand d is the distance from the origin to the plane. a, b, c and d arethen found e.g. using a conventional AR toolkit.

Thus, the ground plane may be described analytically e.g. as anequation, which holds for any point (x, y, z) which lies on the groundplane. The ground floor may be assumed to be a plane, and smalldeviations from the plane due to a surface of grass or coarse asphalte.g., may be ignored.

It is appreciated that L is known, because if the ground plane is known,the 3D ray from the focus that intersects the 2D pixel location can befound with known intrinsics e.g. as described in FIG. 3, which is basedon Slide 10 of http://www.cse.psu.edu/˜rtc12/CSE486/lecture12.pdf whichgives formulas parameterized in depth Z.

An example illustrating determination of an absolute 3D location of ajoint touching a plane with known parameters is now described in detail;it is appreciated that the plane may alternatively be parameterized indifferent ways, whereas the example assumes the plan is parameterizedusing the Hessian normal form:

n_x*X+n_y*Y+n_z*Z=−p (equation 6 fromhttp://mathworld.wolfram.com/HessianNormalForm.html)   (I)

The parameters n_x, n_y, n_z and p may be determined by operation 30herein. The above equation (I) is true if (X, Y, Z) lie on the plane.The relationship between absolute 3D location (X, Y, Z) of a joint andits projected 2D pixel location (u, v) is given by the following systemof equations, aka “camera projection equation/s”

x′=X/Z   (II)

y′=Y/Z   (III)

u=f_x*x′+c_x   (IV)

v=f_y*y′+c_y   (V)

(see equations inhttps://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html(Note x=X, y=Y and z=Z as world coordinates are typically not relevantto the present context). As the absolute 3D joint location is touchingor intersecting the plane, assume the absolute 3D location lies on theplane. Therefore the equations I-V can be solved for the 3D absolutelocation (X, Y, Z).Solving IV and V for x′ and y′:

u=f_x*x′+c_x⇔x′=(u−c_x)/f_x

v=f_y*y′+c_y⇔y′=(v−c_y)/f_y

and inserting values of x′ and y′ into and III:

(u−c_x)/f_x=X/Z⇔(u−c_x)/f_x*Z=X   (VI)

(v−c_y)/f_y=Y/Z⇔(v−c_y)/f_y*Z=Y   (VII)

And inserting these into I, results in a solution for Z

n_x*(u−c_x)/f_x*Z+n_y*(v−c_y)/f_y*Z+n_z*Z=−p⇔(n_x*(u−c_x)/f_x+n_y*(v−c_y)/f_y*Z+n_z)*Z=−p⇔Z=−p/(n_x*(u−c_x)/f_x+n_y*(v−c_y)/f_y+n_z)

The value of X and Y may be computed by equations VI and VII:

X=(u−c_x)/f_x*Z

Y=(v−c_y)/f_y*Z

-   -   ii. A ball (e.g. a person who is holding a ball in his ands), in        which case “interaction recognition” may be based on 2D pose and        2D ball locations.    -   iii. Another articulated object (e.g. person B dancing with        person A) whose kinematic lengths are already known. It is        appreciated that (kinematic lengths for a given person imaged in        a frame, may be known (available to the system).

For example, consider a video sequence representing two persons, A andB, playing basketball. A handles or interacts with the ball. This firstinteraction allows A's absolute location to be determined. Then, A bumpsinto B. This subsequent interaction allows the absolute location of B tobe estimated immediately e.g. in real- or near-real time.

Here, Method C is used. Method C detects interactions between objects ofinterest and secondary objects.

Operation 40 typically employs Method C aka requirement iii aka “actionrecognition framework”, which detects interactions of the the objectwith secondary objects (e.g. the person holding the ball in his hands).

Using Method C, any non-articulated object with known size can belocalized by detecting image 2D positions and their mapping to theobjects coordinate system e.g. as described in Z. Zhang, “A Flexible NewTechnique for Camera Calibration”, IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(11):1330-1334, 2000, available online athttps://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr98-71.pdf,aka [Zhang2000].

Alternatively, any other camera calibration technique, not necessarilyZhang 2000, which computes an object's (typically absolute) 3D locationbased on the object's known size, may be used, such as any suchtechnique used in augmented reality applications for fixed objects.

Alternatively, Method C may comprise any of the embodiments described initalics herein:

Provided is a system for tracking one or multiple users interaction withthe environment and sports tools like balls, tennis racket; etc.

The system may use the sensors of a mobile phone (e.g., camera, IMU,audio) and auxiliary devices (e.g. IMU on a smart watch) as well asinter device communication (e.g. UWB for precise localization) tocapture the environment, relevant objects and the spatial configurationand measure activities and interactions, analyze the performance duringthe exercise and provide feedback to the user in real-time during andafter execution of the exercise. Feedback can be acoustic or visual andcan be related to instructions regarding the setup of the system,regarding execution of the exercise, feedback regarding userperformance.The system may include all or any subset of:

-   -   In addition to components of System for interactive physical        exercises using mobile phones—Shot on Goal    -   Computational Resource Management: Methods to estimate at what        frequency and time certain costly operations like ball and pose        detection have to be run to achieve the best results in action        detection and classification as well as methods to determine        volume of interests (region+scale) the image where to focus on        with computations.    -   Uncertainty estimation: Methods to estimate whether a decision        by one or multiple components is confident or f there is a lot        of underlying uncertainty. This improves handling of difficult        cases that are hard to detect and are affected by noise, e.g.        slight touches, ground hits with nearby foot. Estimations with        an increased uncertainty could also trigger other more costly        methods of determining the action if needed.    -   Human pose dynamics model: Methods that enable the system to        understand the probability of an event detection/classification        given a certain pose/movement, e.g. a standing foot cannot be        used for kicking the ball. So a person standing on his right        foot will most likely be kicking the ball with the left foot.

Exemplary Application Ball Handling

In this application the disclosed technology is used to implement anexercise in which the user is judged on his skills with handling theball.In an optional step, the user is scanning the environment with thecamera of the phone to establish a virtual world coordinate system (seeAugmented Reality) and/or the ground plane. As soon as a ball enters thefield of view of the smart phone camera, it is recognized and tracked bythe system (see Image Analysis and Multi-Target Tracking). If multipleballs are detected in the scene the system can determine the relevantone (see Determination of relevant objects). The system detects events,states (e.g. the ball bounced, was kicked, is balanced) and classifiesevents and states (e.g. the ball bounced on the floor, was kicked withthe left foot, was balanced on the shoulders). In addition, the systemcan determine whether the ball is in the air or on the ground.The exercise could be comprised of freestyle scoring of certainevent/state/class combinations or based on achieving a certain sequenceof combinations (e.g. do 10 kicks, alternating between left and rightknee).Exercises can transition automatically without need for the user tointeract manually with the device. Voice instructions can be used toalter the exercises and audio feedback can instruct the user what to donext as well as signal successful and unsuccessful actions.Each ball interaction can be analyzed according to various performancemetrics relevant the exercise, such as incoming velocity, outgoingvelocity, body part used, bounce on environment/ground, complexity ofsequencesOther application examples:

-   -   Determine which foot was used to shoot a goal    -   Ball control to determine skillful handling of the ball

Technical Method

Problem: determine detailed information of actions, activities andinteractions of/between persons, objects (e.g. ball) and environment

-   -   Scene contains one or more persons with each other, one or more        real physical objects (e.g. a ball) and/or one or more virtual        objects placed in the virtual representation of the world    -   Actions, Activities and Interactions can be described as events        and states varying in length    -   For each action/activity/interaction determine e.g. which body        part interacted with the object when, where and for how long:        (classification)        -   Potential event classes: with which body part the            interaction (e.g. bounce of ball on person) occurred (e.g.            left foot, right foot, head, left shoulder, left hand, . . .            )        -   Interaction state classes: e.g. with which body part the            state (e.g. holding, balancing) occurred (e.g. left foot,            right foot, head, left shoulder, left hand, . . . );        -   Action/Activity state classes: e.g. whether the ball is in            air, on ground, static, moving, undefined, etc.    -   Enriched Events:        -   Determine location of where interaction took place (e.g.            part the foot, where on the object)        -   Determine effect of the interaction: e.g. object delta V            (i.e. change of motion and motion direction), person balance        -   Determine energy transfer between interacting objects and            environment (e.g. delta V, waste of kinetic energy) to            determine efficiency useful energy vs wasted energy, e.g.            efficiency of a kick, how much energy from the leg was            transferred to the ball.    -   Optional: additionally problem described in System for        interactive physical exercises using mobile phones—Shot on Goal    -   Optional: Extension to interaction with virtual objects

Solution:

-   -   Temporally adaptive learned model for detection and/or        classification        -   Event Detection            -   Event detection models may be trained that have the                person(s) 2D and/or 3D location, ball 2D and/or 3D                location over time as well as environmental features                (other object locations, room, ground plane, etc.).            -   The model inputs can be structured on a fixed temporal                grid, passed in sequence.                -   The inputs can contain any available datapoint (e.g.                    pose) when available. If due to computational                    constraints and/or throttling no data is available                    for a certain timestamp empty, interpolated or                    extrapolated data can be passed.                -   The fixed temporal grid ensures a fixed inner clock                    for the model and allows for interpolation of                    missing input data                -   The inputs can be normalized to account for                    different spatial configuration relative to the                    camera                -   Positional annealing methods such as exponential                    smoothing can be used to ensure the input data may                    deviate but stay close to a certain range            -   The model is specifically trained to adapt to varying                data input frequencies. The frequencies can vary from                device to device, over time on the same device and per                component (e.g. ball vs. pose detection) and per sensor                on the device (e.g. 30 hz video, 50 hz accelerometer)            -   For each input the model returns a value indicating                whether an event at this timestep is likely                -   The model's outputs can be delayed by a fixed or                    varying amount to gather information about the                    future to improve robustness of the detection                -   Thresholding and/or peak detection ethods can be                    used to filter events            -   The event can be further enriched with information of                object trajectories using its timestamp        -   State Detection            -   The state of the object (e.g. ball) relative to a person                can be e.g. “balancing”, “no contact”, “holding”, etc.            -   Using the relative position of person, body parts, real                and virtual objects over time, the state is determined.            -   The model inputs and processing can be structured                according to the Event Detection Learned Model from                above            -   For each state class the model can have a dependent or                independent output                -   A hysteresis thresholding method can be used to make                    the state estimation more robust to noise        -   Event/State Classification model            -   To classify the events/States, other models are trained        -   Object state classification            -   The physical motion model is used to determine if the                object is moving consistent with gravity (flying), no                gravity (on the ground), or if its movement is not                explainable by gravity alone (manipulated, external                influence)        -   Location of interaction detection            -   Motion models are estimated for person body parts and                relevant objects (e.g. a ball) that can determine given                trajectories for each participating object where the                object the interaction occurred.            -   Direct indicators might be incoming vs outgoing velocity                vector, incoming vs. outgoing object spin, deformation                of objects, amount of force applied, reaction of the                person, etc.    -   Combination of physical motion model and learned model        -   Motion Model+estimation of the intersection point->see in            System for interactive physical exercises using mobile            phones—Shot on Goal        -   The motion model vs learned model can be combined by e.g.            using motion model output used as an additional input to the            Learned Model    -   Robustness Improvements to physical motion model        -   For this ground truth, physical trajectories of objects can            be recorded together with noisy measurements that will be            available at runtime of the system (e.g. using multiple            cameras)        -   A learned model can be trained to map noisy trajectories to            ground truth trajectories of these objects.        -   The model can be used to de-noise/interpolate data into the            motion model.    -   Combination of augmented reality and physical motion model or        learned model        -   Augmented reality methods are used to build scene            understanding of the environment in which the person's            actions need to be analyzed.        -   The system may use plane and/or ground plane detection, 3D            scene reconstruction, depth estimation etc.        -   Based on the 3D location of the object at an event; this can            be used to determine if an interaction was between the            objects, or between object and environment (e.g. bounce of            the ball on the ground plane)    -   High efficiency on low end hardware [Computational Resource        Management]        -   Volume of Interest            -   Using domain information and previous detections of                objects, as well as scene understanding, the system                selects volumes of interest.            -   A volume of interest consists of a 2D region of interest                to crop the image and a depth from which a scale range                is determined to indicate to follow up components how to                analyze the data.            -   This information is then passed on to follow up                components (e.g. object, pose and action recognition) to                focus use of computational resources.            -   As a specific example, the human pose estimation could                only be executed for a region and scale in the image                where the ball is expected to be.        -   Temporally adaptive computation            -   Using a performance profile of the device, a trade-off                problem between number of computations per component and                quality of the predictions is formulated.            -   A system can be created that adaptively maximizes for                the maximally possible prediction accuracy given the                devices battery, heat and computation constraints (e.g.                by training an agent using reinforcement learning).            -   It can be a part of the learned detection/classification                model.            -   The system decides, based on previous data, whether a                new incoming sensor data for a new timestamp should be                processed by a certain component.            -   The adaptively sparsified processed data is then used as                input by the detection.            -   An alternative method can be to fix the frequency of                certain components based on empirical benchmarking.            -   For example the system could decide to accurately                classify events to only process the person's pose if a                change in the ball's trajectory has been detected. Thus                reducing freed up resources from pose estimation that                would otherwise run continuously.    -   Human pose dynamics model        -   Based on the skeletal model and physically measured and/or            simulated properties, a model of a human's pose over time            can be estimated.        -   This model will be updated with measurements and is used to            rate the probability of certain events/states and            classifications given the current state of the model.        -   One of the properties of the model can be a weight            distribution and/or tension map throughout the skeleton that            represents movement potential for the different body parts.        -   Using this or other methods the standing leg can be            determined.        -   The event classification can be augmented by information on            this model (e.g. the standing leg, suggesting that if a kick            was detected, the non-standing leg was involved).

Optional Extensions:

-   -   Physical model to determine energy transfer        -   The human pose dynamics model can be extended to model            energy transfer.        -   The movement potential of the model can be used to determine            likelihood of a specific effect (resulting trajectory of the            object given the interaction with a body part) to detect and            filter out events.

Operation 50

Transfer of location information.

In every interaction, the absolute 3D location of the object, and theabsolute 3D location of the articulated object's joint, become equal.

This, combined with relative 3D locations of joints, can be used tocompute the common/shared/equal absolute 3D location of the secondaryobject and of the portion of the articulated object that is interactingwith the secondary object.

This results in an estimate of the kinematic chain lengths e.g. byestimating distance between absolute joint locations.

The estimate of the kinematic chain lengths may be used in operations 72and/or 77.

Operation 60

Using results of object tracker or re-identification operation 20,and/or by collecting all the estimates of operation 50, estimate theobject's size over time.

Definition:

Tree aka kinematic tree: Modelling the human body as a tree of rigidlinks is known, where the rigid links include the torso and various limbportions such as upper arm till the elbow, lower leg from the knee down,etc. an example tree is illustrated pictorially in FIG. 4.

Typically, at least one tree, or each tree, comprises only a subset ofall possible joint to joint connections. Typically, only thosejoint-to-joint connections or joint pairs which are distance invariante.g. always stay at the same distance from one another (e.g. joints oneither end of a single bone such as the elbow and wrist joints which areon either end of the lower arm bone), are included in the subset.

Example: a kinematic tree may include only the following 4 joint pairs:Left shoulder to left elbowRight shoulder to right elbowLeft elbow to left wristRight elbow to right wristThe tree may be stored in digital memory as U, V index mapping e.g. asdescribed below.

The method of operation 60 uses a pair of mappings, where each of the 2mappings in each pair, maps a connection or joint (typically representedas an index array U[i]) to j, where j is a specific key point from amongn key-points.

Lower case I is an index used to denote a given connection i in akinetic tree modeling a given human body. 1<=i<=m, with m as the numberof connections in the body/tree.

For example P=R{circumflex over ( )}(6×3) (=P^(6×3) i.e. {circumflexover ( )} denotes the exponent) are six absolute 3D locations of leftshoulder (1), right shoulder (2), left elbow (3), right elbow (4), leftwrist (5) and right wrist (6) respectively.Thus P typically comprises a matrix of 6 rows and 3 columns, storingabsolute 3D locations per row, with

U=[1, 2, 3, 4] V=[3, 4, 5, 6]

As above, the kinematic tree consists of only the following joint pairsor kinematic pairs (aka “pairs of mappings”):Left shoulder (1) to left elbow (3)Right shoulder (2) to right elbow (4)Left elbow (3) to left wrist (5)Right elbow (4) to right wrist (6).

Thus the U array index stores indices of one joint aka the “startingjoint” of the 4 joint pairs in the tree, whereas the V array indexstores indices of the other end aka “end joint” of the 4 joint pairs inthe tree.

Typically, for index i running from 1 to 4 a joint pair (U[i], V[i]) isa kinematic pair; the two joints are physically connected, and theirdistance stays the same, irrespective of how the joints are moving in3D.

The absolute locations P of an articulated object typically eachcomprise a set of 3D vectors (x, y, z). This set typically representsthe absolute 3D location of all or at least some of the joints of thearticulated object.

The person size typically is represented as a collection ofjoint-to-joint lengths which typically includes only a subset of jointto joint lengths, where the lengths in the subset are typically onlythose for which distances do not change, for example a measurement ofleft wrist to left elbow may be included as these joints are connectedphysically, hence distances between them do not change, whereas thedistance of left wrist to right wrist is not included in thiscollection, as the distance between wrists, which are not connectedphysically, can change arbitrarily.

s_j (real world or physical size, e.g. in meters) s a 1×m vector s,typically in R{circumflex over ( )}m=R^(m) where the {circumflex over( )} notation again denotes a superscript and where m is the number ofkinematic pairs.

j as above is an index of a specific 3d absolute joint location (aka“key point”). Its value is between 1 and n, as n is the number of 3Dabsolute joint locations of a person.

The person size vector s_j is defined as the Euclidean Distance betweenthe absolute locations of index arrays U[i] and V[i], for a givenconnection i in a kinetic tree modeling a given human body. It isappreciated that U is an array of start indices, Whereas V is an arrayof end indices.

Typically, the indication of start/end may be arbitrarily pre-definede.g. the elbow may be predefined as the “start” of the lower arm, or inanother embodiment as the “end” of the lower arm.

Thus the person size vector may be computed as the Euclidean Distancebetween P[U[i]] and P[V[i]] where, as above, P denotes the absolutelocations of an articulated object, where P typically comprises a set of3D vectors (x, y, z). This set typically represents the absolute 3Dlocation of all or at least some of the joints of the articulatedobject.

The above operation, operation 60, is performed for each frame in agiven video sequence. For each frame, all estimates above are collected.

Thus, when operation 60 is completed for one frame, the output includes(for each tracked person in the video sequence) at least one estimate ofthe real-life size of that person. Typically, plural estimates of aperson's size are collected, e.g. one estimate for each time that thisperson (say, Rachel) interacts with an object of known size e.g. with aball of known size, or with a person (say, Joseph) whose size is known,having been derived previously from Joseph's handling of the ball ofknown size.

Typically, each estimate of a person's size includes plural e.g. m bonelength measurements e.g. in meters.

Typically, all persons in the video sequence are tracked, and personsize estimates are accumulated for each, where each estimate of aperson's size typically includes measurements of certain of the person'sbone lengths e.g. in meters.

Typically, and when operation 60 is completed for all frames in thevideo sequence, the output of operation 60 includes, for each trackedperson in the video sequence, a plurality of estimations of thatperson's size.

At this stage, Operations 72, 77 should be used to find a “best” personsize estimate, from among all person size estimates collected inoperation 60, over time, i.e. over all frames in the video sequence.Typically, each person size estimate includes measurements of certain ofthe person's bone lengths e.g. in meters.

Operation 72: Estimate a “best” averageSelect a “best” measurement for each bone e.g. a median of all availablebone measurements (e.g. if plural bone measurements are available,having been derived from respective plural interactions of the relevantjoint/s with object/s of known size. Optionally, outliers may beremoved.Example: Bone lengths are typically a distance between the joints oneither end of the bone e.g. shoulder to elbow (upper arm bone), or elbowto wrist (lower arm bone).

In this example, two bone lengths are measured: shoulder to elbow andelbow to wrist. 7 estimates are available because the person touched theball of known size 7 times. The 7 measurements of each of the 2 bonesare sorted by size:

[0.1 m, 0.3 m, 0.31 m, 0.32 m, 0.35 m, 1.0 m, 1.2 m] for shoulder toelbow (upper arm); and[0.2 m, 0.29 m, 0.3 m, 0.34 m, 0.34 m, 0.9 m, 1.4 m] for lower arm(elbow to wrist).Taking the median would result in a 0.32 m for shoulder to elbow and0.34 m for elbow to wrist.

A “best” measurement for each bone may optionally be defined as anaverage bone measurement, e.g. over plural available measurements ofthat bone, where the average is typically computed disregardingoutliers, where outliers are defined as bone measurements whose distancefrom the closest other measurements is large, compared to the distancesbetween other measurement pairs.

Operation 77

The “best” bone length measurement is then used, computed in operation72, to scale (e.g. as described below with reference to operation 80)the relative 3D joint locations into absolute 3D joint locations.

Thus, person size, typically comprising a collection of plural lengthsalong the kinematic chain, is estimated by operations 72, 77. Theestimates of kinematic chain lengths generated in operation 50 may becollected e.g. stored digitally, and used in operation 77.

Operation 80

Compute, and provide output indication of, absolute 3D pose, where“pose”=a set of absolute 3D locations for each joint in the body of atracked person, Sarah.

${typically},{{{{Sarah}'}s\mspace{14mu} {ABSOLUTE}\mspace{14mu} 3D\mspace{14mu} {POSE}} = \frac{\begin{matrix}{\left( {{relative}\mspace{14mu} 3D\mspace{14mu} {joint}\mspace{14mu} {locations}} \right) \times} \\\left( {{mean}\mspace{14mu} {over}\mspace{14mu} {all}\mspace{14mu} {estimated}\mspace{14mu} {kinematic}} \right. \\{{lengths}\mspace{14mu} \left( {{e.g.\mspace{14mu} {all}}\mspace{14mu} {bone}\mspace{14mu} {lengths}} \right)}\end{matrix}}{\begin{matrix}\left( {{mean}\mspace{14mu} {over}\mspace{14mu} {all}\mspace{14mu} {kinematic}\mspace{14mu} {lengths}\mspace{14mu} {given}} \right. \\\left. {{by}\mspace{14mu} {the}\mspace{14mu} {relative}\mspace{14mu} 3D\mspace{14mu} {joint}\mspace{14mu} {locations}} \right)\end{matrix}}}$

The above typically comprises a matrix of one row per joint, and 3columns for 3 coordinates. The relative 3D joint locations is also amatrix of the same dimensions, and typically the fraction yields to asingle scaling factor.

${Typically},{{{absolute}\mspace{14mu} 3d\mspace{14mu} {location}\mspace{14mu} {for}\mspace{14mu} {{Sarah}'}s\mspace{14mu} {joint}\mspace{14mu} J} = \frac{\begin{matrix}{\left( {{relative}\mspace{14mu} 3D\mspace{14mu} {location}\mspace{14mu} {of}\mspace{14mu} {joint}\mspace{14mu} J} \right) \times} \\\left( {{mean}\mspace{14mu} {over}\mspace{14mu} {all}\mspace{14mu} {estimated}\mspace{14mu} {kinematic}} \right. \\{{lengths}\mspace{14mu} \left( {{e.g.\mspace{14mu} {all}}\mspace{14mu} {bone}\mspace{14mu} {lengths}} \right)}\end{matrix}}{\begin{matrix}\left( {{mean}\mspace{14mu} {over}\mspace{14mu} {all}\mspace{14mu} {kinematic}\mspace{14mu} {lengths}\mspace{14mu} {given}} \right. \\{{by}\mspace{14mu} {the}\mspace{14mu} {relative}\mspace{14mu} 3D\mspace{14mu} {joint}\mspace{14mu} {locations}\mspace{14mu} {e.g.}} \\{\left. {{as}\mspace{14mu} {described}\mspace{14mu} {in}\mspace{14mu} {the}\mspace{14mu} {example}\mspace{14mu} {below}} \right).}\end{matrix}}}$

The above equation may be used once for each of Sarah's joints.

The above computation is referred to herein as “simple scaling”. Analternative computation using each measured bone length individually isnow described by way of example. The example computation below typicallyyields a “pose” (absolute 3D joint locations).

Example Computation of a “Pose”

In Operation 72, assume by way of example that the following bonelengths have been computed:0.32 m for shoulder to elbow and 0.34 m forelbow to wrist

-   -   Now, the relative 3D pose estimator predicts the following        relative joint locations which may be provided by operation 10:

[0 m, 0 m, 0 m] shoulder

[0 m, 0 m, 0.2 m] elbow

[0 m, 0.212 m, 0.2 m] wrist.

Since it is known that the directions defined by the above relativejoint locations are correct, one needs to select a kinematic root (anyjoint, e.g. the right shoulder) and start applying the correctdirections starting from that kinematic root. e.g. setting

Absolute_shoulder_location=Relative_shoulder_location=[0 m, 0 m, 0 m]

Then, typically, iterate over all chains, e.g. shoulder->elbow,elbow->wrist.

-   -   The following two operations (elbow correction and wrist        correction) are typically performed for each shoulder->elbow,        elbow->wrist chain hence for all such chains (e.g. all such        chains for each person at the frame at which the person has        touched an object of known size). The operation typically        iterates over the person chains, typically starting from the        root, in any order.    -   The above operation pertains to a moment the person has touched        an object of known size. Thus, if a person (say) has touched a        ball (say) plural times (perhaps 17 times), the above operation        is typically executed 17 (say) times.

elbow correction aka Iteration 1: Correct the absolute elbow locatione.g. by computing:

Absolute_elbow_location=absolute_shoulder_location[0 m, 0 m, 0m]++normalized(relative_elbow_location [0 m, 0 m, 0.2 m]relative_shoulder_location [0 m, 0 m, 0m])**(shoulder_to_elbow_length of0.32 m)=[0 m, 0 m, 0.32 m],

Where a 3d vector x, y, z, may be normalized as follows:

normalized([x,y,z])=[x, y, z]/sqrt(x*x+y*y+z*z)

Wrist correction aka Iteration 2: Correct the absolute wrist locatione.g. by computing:

Absolute_wrist_location=absolute_elbow_location 0 m, 0 m, 0.32m++normalized (relative_wrist_location [0 m, 0.212 m, 0.2m]—relative_elbow_location [0 m, 0 m, 0.2 m])**elbow to wrist length of0.34 m=[0 m, 0.34 m, 0.32 m]

A translation is typically added to or applied to the above-computedabsolute locations (e.g. the joint locations after correction thereof).This translation may be found using an iterative camera calibrationprocess.

It is appreciated that the above computation of the absolute locationmay set the root joint at a location [0 m, 0 m, 0 m], wherein the personis scaled correctly, however the person's position is not correct.Therefore, an optimization process is typically performed to find a 3Dtranslation (e.g. in meters) which minimizes the re-projection error, orto identify a translation vector which aligns the projected 3D pose soas to best match the actual 2D pose in the image.

Optimization may be achieved using a suitable function minimalizationmethod (e.g. a “gradient descent” method) optimizing or minimizing there-projection error (e.g. distance between 2D pixel locations andprojection 2D locations) over the three parameters of the translation.

It is desired to minimize the re-projection error, which (e.g. asdescribed in https://en.wikipedia.org/wiki/Reprojection error) may becomputed as the distance between projected point (absolute 3D locations)and measured joint pixel locations.

The 2D pixel locations may be 2D joint locations from a 2D poseestimation algorithm operative to provide a 2D pose estimation.

It is appreciated that some algorithms (e.g. that described herein withreference to operation 7 provides both a relative 3D pose estimation anda 2D pose estimation.

Or, alternatively, a relative 3D pose estimation may be provided by onealgorithm and/or a 2D pose estimation may be provided by anotheralgorithm.

The re-projected 2D locations may be computed by shifting projection 3Djoint locations by a translation vector. Typically, the above may beachieved by using an optimization function e.g. as described above, tofind an optimal translation subject to minimizing the sum of(project(absolute 3D location+translation)−2D location)², where “+”denotes shifting.

The global pose computation (3D localization) described on page 9(supplementary section 3) of “Monocular 3D Human Pose Estimation In TheWild Using Improved CNN Supervision” by Dushyant Mehta et al, availableonline at the following http location: arxiv.org/pdf/1611.09813v5.pdf,describes a closed-form approximation of this translation vector, seee.g. formula (5) and the formulae 3.1 and 3.2 for x and y respectively,which appear in FIG. 2 herein.

Mehta et al describes inter alia the following, which refers to theequations and expressions of FIG. 3:

“Global Pose Computation 3.1. 3D localization”

A simple, yet very efficient method is described to compute the global3D location T of a noisy 3D point set P with unknown global position.Scaling and orientation parameters are assumed to be known, obtainedfrom its 2D projection estimate K in a camera with known intrinsicparameters (focal length f). It is further assumed that the point cloudspread in depth direction is negligible compared to its distance z0 tothe camera and approximate perspective projection of an object nearposition (x0, y0, z0) {circumflex over ( )}T with weak perspectiveprojection (linearizing the pinhole projection model at z0), e.g. usingEquation 1.

Estimates K and P are assumed to be noisy due to estimation errors.Optimal global position T in the least squares sense are found byminimizing T=arg min(x,y,z) E(x, y, z), with Equation 2, where P i andKi denote the I'th joint position in 3D and 2D, respectively, and P i[xy] the xy component of P i . It has partial derivative as per equation3, where P[x] denotes the x part of P, and P⁻ the mean of P over alljoints. Setting Equation 3 to zero (e.g. solving ∂E/∂x=0 gives theunique closed form solutions for x, y provided in Equations 3.1 and 3.2respectively, for ∂E/∂y=0.

Substitution of x and y in E and differentiating with respect to zyields Equation 4.

Finally, solving ∂E/∂z=0 gives the depth estimate z of Equation 5, whereexpression 6 is approximated for θ˜0. This is a valid assumption in thiscase, since the rotation of 3D and 2D pose is assumed to be matching.

Typically, in the above method, z is first solved, and x and y are thencomputed.

A simpler method which simply applies a scalar for conversion and mayreplace the above method (the simpler method below being particularlyuseful e.g. if body proportions may be assumed to be uniform overdifferent persons), is now described:

${{ABSOLUTE}\mspace{14mu} 3D\mspace{14mu} {POSE}} = \frac{\begin{matrix}{\left( {{relative}\mspace{14mu} 3D\mspace{14mu} {joint}\mspace{14mu} {locations}} \right) \times} \\\left( {{mean}\mspace{14mu} {over}\mspace{14mu} {all}\mspace{14mu} {estimated}\mspace{14mu} {kinematic}} \right. \\{{lengths}\mspace{14mu}\left( {= {{all}\mspace{14mu} {estimated}\mspace{14mu} {bone}\mspace{14mu} {lengths}}} \right)}\end{matrix}}{\begin{matrix}\left( {{mean}\mspace{14mu} {over}\mspace{14mu} {all}\mspace{14mu} {kinematic}\mspace{14mu} {lengths}\mspace{14mu} {given}} \right. \\\left. {{by}\mspace{14mu} {the}\mspace{14mu} {relative}\mspace{14mu} 3D\mspace{14mu} {joint}\mspace{14mu} {locations}} \right)\end{matrix}}$

This yields a mean of absolute bone lengths (0.34 m+0.32 m)/2=0.33 m anda mean of relative bone lengths (0.2 m+0.212 m)/2=0.206 m.

It is appreciated that the above computation (performed for just twoconnections of a kinematic chain by way of example) is performed for allconnections along the kinematic chain, resulting in a scale factor of1.601, computed by dividing 0.33 m [mean absolute bone length] by 0.206m [mean relative bone length].

Multiplying this scale factor by each of:

-   -   [0 m, 0 m, 0 m] shoulder    -   [0 m, 0 m, 0.2 m] elbow    -   [0 m, 0.212 m, 0.2 m] wrist results in the following “pose”        (absolute 3D locations for the following 3 joints respectively):    -   [0 m, 0 m, 0 m] shoulder (because the right shoulder is the        selected kinematic root    -   [0 m, 0 m, 0.32 m] elbow    -   [0 m, 0.34 m, 0.32 m] wrist

It is appreciated that it may be desired to compute the absolutelocation of an object e.g. person, who never once, throughout the videosequence, interacts with the/a object y of known size. However,typically, the method herein may be used to determine the absolutelocation and/or size of any object x′ which does interact, at sometimewithin the video sequence, with the object y of known size. Then, thoselocations may be used to determine the absolute location and/or size ofany object x″ which does interact, at sometime within the videosequence, with any object x′. Then, the absolute locations of object/sx″ may be used to determine absolute locations and/or size of any objectx′″ which does interact, at any time within the video sequence, with anyobject x″, and so forth, until absolute locations of even those objectswhich never once, throughout the video sequence, interact with the/aobject y of known size, become known.

It is appreciated that all or any subset of operations 5-80 may beperformed on a single image depicting interacting objects. However, ifthe operations are performed on a video sequence of images, thisimproves accuracy by gathering multiple estimates over time, and thencombining these estimates in any suitable manner e.g. by averaging,either over all estimates or over whichever estimates remain afteroutlying estimates are disregarded. Using video data, the sequentialabsolute positions of an object x whose relative locations to otherobjects in the video sequence are known, may be computed, by detectinginteractions of object x with any object y whose absolute locations areknown (e.g. having previously been computed using the method shown anddescribed herein).

Operation 90 (example use-cases)Use the absolute 3D pose output (which typically includes absolute 3Dlocations for each (or at least one) joint in each tracked person'sbody) generated in operation 80, for use cases such as but not limitedto estimating positions of persons in a video (e.g. basketball playersin a broadcast video) or detecting speed of a person in a video sequence(e.g. a person who is handling a ball). A particular advantage is thatthe method corrects for person size, thereby correctly handling bothchildren & adults, rather than, for example, confounding children at adistance x from the camera, with adults at a distance X>x from thecamera.

Many different uses may be made of the methods shown and describedherein. For example, in some embodiments, e.g. for consumer analytics,the method herein allows an adult who has approached the window (henceis engaged with the displayed wares) to be distinguished from a childwho is far from (hence not engaged with) the displayed wares, or, moregenerally, the method prevents the distance between consumer and waresto be computed without that distance being confounded with the size ofthe consumer.

Or, in some embodiments, movement speeds of different body joints may becomputed e.g. to “diagnose” an athlete's kicking speed or punchingspeed, based on the absolute 3D location of her or his wrists and anklesas computed using the methods herein. Kicking or punching speed in termallows other athlete characteristics to be identified, e.g. whether theathlete or fighter is more of a kicker or a puncher, or how dominant thefighter's left/right side is. Any suitable method may be employed tocompute the speed of an athlete's limbs, based on the athlete's absolute3D pose.

Joint speed may be computed conventionally, by computing distancesbetween a joint's (typically absolute) 3D locations over time anddividing by the time period separating sequential locations.

A person's speed may be computed by computing the average speed of allor some of her or his joints. Note that selection of which joints areemployed, depends on the use case. For example, if a person is walking,s/he tends to swing her or his arms back and forth, therefore whencomputing walking speed, the following joint speeds may for example beaverage: hips, shoulders, face, whereas the speeds of the elbows andwrists would typically riot be used.

Individual speed of individual joints may be used, for example, toanticipate based on the fact that a person has moved her or hisshoulders forward, that the person is about to move, even though hisfeet still have a speed of zero.

Basketball player positions may be used as an input to other analyticsdetecting interesting moments, for example tactical situations wherethere is an increased possibility of a goal e.g. because players areapproaching the goalpost. This is indicative of upcoming possible‘exciting’ moments in a game, and thus, the basketball player positionsmay be used to notify users, e.g. via apps, that something interestinghas happened or is about to happen. Determining exciting moments mayalso involve deducing a basketball (say) player's intent, e.g. analyzingif s/he is moving fast towards the goal, which may be indicative of anintent to attack which may be deemed an exciting moment, whereasmovement toward the goalpost at a slow speed may be indicative of theplayer's waiting for a particular moment to move towards the otherteam's position, which may not be deemed an exciting moment.

Other use cases may include all or any subset of the following:

In consumer mobile apps, knowing the person location can give rise toanalytics deriving the 3D motion of the person, and leveraging analyticswhere the person interacts with other objects such as a soccer ball.This can be used e.g. for detecting juggling, joggling/tricks with aball in a single to multi-player setup.

For health care, 3D absolute positions can be used to distinguish rightposture from wrong, or abnormal gait from normal gait.

Customers' position and size may be used to determine if they areengaged in the shopping window display. Here, having an accurate sizeand position can distinguish between small persons and persons furtheraway. Alternatively or in addition, given the person's size, analyticson customer demographics, such as categorizing customers as beingchildren vs adults, can be created.

An Automated driving or driving assistant can use the person positionand person size to detect possible impacts, provide collision avoidance,and warn for dangerous situations.

According to certain embodiments, athlete positions in upcoming framesare predicted, and virtual ads are positioned accordingly, so as not toobscure these players, by placing the virtual ads in (absolute orrelative) locations other than the predicted player positions inupcoming frames.

According to certain embodiments, a system for accurate 3D poseestimation is provided which includes logic configured for using a 3Dpose estimation from image and/or other 3D localization techniques, todetermine e.g. 3D positions based on interactions with object/s e.g.balls having known size (and/or ground floor estimation), e.g. usingsensors on a phone, and/or an action recognition framework; as describedherein. An action recognition framework may be used to detectinteractions between Objects e.g. to recognize an interaction of a jointof an articulated object, e.g. person, with an object of known size.

For example, a phone's accelerometer may yield an acceleration vectorpointing downwards towards the center of the Earth. Therefore, a floorplane is perpendicular to this vector. The ground plane may beparameterized based on a perpendicular vector and distance of the groundplane from the coordinate system's origin. The origin may be arbitrarilydefined. For example, the origin maybe defined as the camera's positionin the first frame, or the point on the floor that the camera is facingin the first frame). the accelerometer vector may be used as aconstraint in fitting the ground-plane e.g., as in the two referencescited below, the ground-plane may be estimated by multiple sensors,where each sensor adds a constraint or information that can be used infinding the best fit.

It is appreciated that any suitable method may be employed for fusingsensor data with visual data to estimate orientation of a ground plane,such as but not limited to:

https://ap.isr.uc.pt/wp-content/uploads/mrl_data/archive/124.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5676736/

Notably in the latter (nih) reference, accelerometer, gyroscope andvisual data are all fused together and a Kalman filter takes care of allthe sensor noises e.g. as described in:

https://www.instructables.com/id/Accelerometer-Gyro-Tutorial/

Typically, the accelerometer is not solely relied upon, since as asensor it may be noisy. Instead, multiple phone sensor measurements,typically accumulated over time, may be used to correct, say, groundplane orientation, using, say, rotations estimated by the phone'sgyroscope e.g. as described in

https://ap.isr.uc.pt/wp-content/uploads/mrl_data/archive/124.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5676736/.

According to certain embodiments, a phone camera images the Objectswhose size and/or position is to be estimated, and that phone's sensor/se.g. accelerometer is used e.g. as above.

According to certain embodiments, the method herein is performed as astreaming method, in which frames are processed one after the other.Typically, if no knowledge about size is known (e.g. if A interacts withB but B's size is not known), no knowledge is passed, and no person sizeestimate is computed.

According to another possible embodiment, data regarding an interactionbetween A and B in frame f may be stored even if neither A's size norB's is known, since the size of A (say) may be derivable from frame/slater than f which would then enable the size of B to be derivedretroactively from the size of A, known from the later frame/s, inconjunction with the stored data regarding the interaction between A andB which occurred in frame f.

it is appreciated that, typically, for at least one frame, or allframes, in which a person (say) is touching a ball (say) whose locationand/or size are known, the absolute 3D locations of the joint touchingor closest to the ball, and of the ball itself, are assumed to be equal.Given this assumption, the relationship between the absolute 3D locationof the ball and the 2D joint location may be determined e.g. using asuitable camera projection equation or system of equations e.g. thatpresented herein.

It is appreciated that fixed sizes characterizing an object o1 e.g.person (e.g. the actual distance between the person's right wrist andelbow, which is known to remain fixed between frames as opposed to thedistance between that person's right and left wrists which does notremain fixed) may be measured, for use e.g. as described herein, inconverting relative 3D joint locations, belonging to that object o1 e.g.person, to absolute 3D joint locations for object o1, in frames whichmay even include frames in which the object o1 e.g. person is notinteracting with, or is not in direct contact with, an object o2 havinga 3D absolute location and/or size which are known. Alternatively or inaddition, measuring such distances is useful for estimating o1's size orbody proportions.

a particular advantage of certain methods herein is that averaging isperformed on measured person sizes e.g. measured bone lengths, ratherthan averaging over 3D locations which would undesirably average out anymotions of the bones.

It is appreciated that terminology such as “mandatory”, “required”,“need” and “must” refer to implementation choices made within thecontext of a particular implementation or application describedherewithin for clarity, and are not intended to be limiting, since, inan alternative implementation, the same elements might be defined as notmandatory, and not required, or might even be eliminated altogether.

Components described herein as software may, alternatively, beimplemented wholly or partly in hardware and/or firmware, if desired,using conventional techniques, and vice-versa. Each module or componentor processor may be centralized in a single physical location orphysical device, or distributed over several physical locations orphysical devices.

Included in the scope of the present disclosure, inter alia, areelectromagnetic signals in accordance with the description herein. Thesemay carry computer-readable instructions for performing any or all ofthe operations of any of the methods shown and described herein, in anysuitable order, including simultaneous performance of suitable groups ofoperations as appropriate. Included in the scope of the presentdisclosure, inter alia, are machine-readable instructions for performingany or all of the operations of any of the methods shown and describedherein, in any suitable order; program storage devices readable bymachine, tangibly embodying a program of instructions executable by themachine to perform any or all of the operations of any of the methodsshown and described herein, in any suitable order i.e. not necessarilyas shown, including performing various operations in parallel orconcurrently rather than sequentially as shown; a computer programproduct comprising a computer useable medium having computer readableprogram code, such as executable code, having embodied therein, and/orincluding computer readable program code for performing any or all ofthe operations of any of the methods shown and described herein, in anysuitable order; any technical effects brought about by any or all of theoperations of any of the methods shown and described herein, whenperformed in any suitable order; any suitable apparatus or device orcombination of such, programmed to perform, alone or in combination, anyor all of the operations of any of the methods shown and describedherein, in any suitable order; electronic devices each including atleast one processor and/or cooperating input device and/or output deviceand operative to perform e.g. in software any operations shown anddescribed herein; information storage devices or physical records, suchas disks or hard drives, causing at least one computer or other deviceto be configured so as to carry out any or all of the operations of anyof the methods shown and described herein, in any suitable order; atleast one program pre-stored e.g. in memory or on an information networksuch as the Internet, before or after being downloaded, which embodiesany or all of the operations of any of the methods shown and describedherein, in any suitable order, and the method of uploading ordownloading such, and a system including server/s and/or client's forusing such; at least one processor configured to perform any combinationof the described operations or to execute any combination of thedescribed modules; and hardware which performs any or all of theoperations of any of the methods shown and described herein, in anysuitable order, either alone or in conjunction with software. Anycomputer-readable or machine-readable media described herein is intendedto include non-transitory computer- or machine-readable media.

Any computations or other forms of analysis described herein may beperformed by a suitable computerized method. Any operation orfunctionality described herein may be wholly or partiallycomputer-implemented e.g. by one or more processors. The invention shownand described herein may include (a) using a computerized method toidentify a solution to any of the problems or for any of the objectivesdescribed herein, the solution optionally includes at least one of adecision, an action, a product, a service or any other informationdescribed herein that impacts, in a positive manner, a problem orobjectives described herein; and (b) outputting the solution.

The system may, if desired, be implemented as a web-based systememploying software, computers, routers and telecommunications equipment,as appropriate.

Any suitable deployment may be employed to provide functionalities e.g.software functionalities shown and described herein. For example, aserver may store certain applications, for download to clients, whichare executed at the client side, the server side serving only as astorehouse. Any or all functionalities e.g. software functionalitiesshown and described herein may be deployed in a cloud environment.Clients e.g. mobile communication devices such as smartphones may beoperatively associated with but external to the cloud.

The scope of the present invention is not limited to structures andfunctions specifically described herein and is also intended to includedevices which have the capacity to yield a structure, or perform afunction, described herein, such that even though users of the devicemay not use the capacity, they are, if they so desire, able to modifythe device to obtain the structure or function.

Any “if-then” logic described herein is intended to include embodimentsin which a processor is programmed to repeatedly determine whethercondition x, which is sometimes true and sometimes false, is currentlytrue or false and to perform y each time x is determined to be true,thereby to yield a processor which performs y at least once, typicallyon an “if and only if” basis e.g. triggered only by determinations thatx is true and never by determinations that x is false.

Any determination of a state or condition described herein, and/or otherdata generated herein, may be harnessed for any suitable technicaleffect. For example, the determination may be transmitted or fed to anysuitable hardware, firmware or software module, which is known, or whichis described herein to have capabilities to perform a technicaloperation responsive to the state or condition. The technical operationmay for example comprise changing the state or condition, or may moregenerally cause any outcome which is technically advantageous given thestate or condition or data, and/or may prevent at least one outcomewhich is disadvantageous given the state or condition or data.Alternatively, or in addition, an alert may be provided to anappropriate human operator, or to an appropriate external system.

Features of the present invention, including operations which aredescribed in the context of separate embodiments, may also be providedin combination in a single embodiment. For example, a system embodimentis intended to include a corresponding process embodiment and viceversa. Also, each system embodiment is intended to include aserver-centered “view” or client centered “view”, or “view” from anyother node of the system, of the entire functionality of the system,computer-readable medium, apparatus, including only thosefunctionalities performed at that server or client or node. Features mayalso be combined with features known in the art and particularlyalthough not limited to those described in the Background section or inpublications mentioned therein.

Conversely, features of the invention, including operations, which aredescribed for brevity in the context of a single embodiment or in acertain order may be provided separately or in any suitablesubcombination, including with features known in the art (particularlyalthough not limited to those described in the Background section or inpublications mentioned therein) or in a different order. “e.g.” is usedherein in the sense of a specific example which is not intended to belimiting. Each method may comprise all or any subset of the operationsillustrated or described, suitably ordered e.g. as illustrated ordescribed herein.

Devices, apparatus or systems shown coupled in any of the drawings mayin fact be integrated into a single platform in certain embodiments ormay be coupled via any appropriate wired or wireless coupling such asbut not limited to optical fiber, Ethernet, Wireless HomePNA, power linecommunication, cell phone, Smart Phone (e.g. iPhone), Tablet, Laptop,PDA, Blackberry GPRS, Satellite including GPS, or other mobile delivery.It is appreciated that in the description and drawings shown anddescribed herein, functionalities described or illustrated as systemsand sub-units thereof can also be provided as methods and operationstherewithin, and functionalities described or illustrated as methods andoperations therewithin can also be provided as systems and sub-unitsthereof. The scale used to illustrate various elements in the drawingsis merely exemplary and/or appropriate for clarity of presentation andis not intended to be limiting.

Any suitable communication may be employed between separate units hereine.g. wired data communication and/or in short-range radio communicationwith sensors such as cameras e.g. via. WiFi, Bluetooth or Zigbee.

It is appreciated that implementation via a cellular app as describedherein is but an example, and, instead, embodiments of the presentinvention may be implemented, say, as a smartphone SDK; as a hardwarecomponent; as an STK application, or as suitable combinations of any ofthe above.

Any processing functionality illustrated (or described herein) may beexecuted by any device having a processor, such as but not limited to amobile telephone, set-top-box, TV, remote desktop computer, gameconsole, tablet, mobile e.g. laptop or other computer terminal, embeddedremote unit, which may either be networked itself (may itself be a nodein a conventional communication network e.g.) or may be conventionallytethered to a networked device (to a device which is a node in aconventional communication network or is tethered directly orindirectly/ultimately to such a node).

1. A system for estimating an absolute 3D location of at least oneobject x imaged by a single camera, the system including: processingcircuitry configured for identifying an interaction, at time t, ofobject x with an object y imaged with said object x by said singlecamera, including logic for determining object y's absolute 3D locationat time t, and providing an output indication of object x's absolutelocation at time t, derived from the 3D location, as known, at time t,of object y.
 2. A system according to claim 1 wherein said object xcomprises a first joint interconnected via a limb of fixed length to asecond joint and wherein the system determines said fixed length andthen determines said second joint's absolute 3D location at time t to bea location L whose distance from object x's absolute location at time t,is said fixed length.
 3. A method for providing an output indication ofabsolute 3D locations of objects imaged by a single camera, the methodincluding using a hardware processor for performing, at least once, thefollowing operations: a. providing Relative 3D joint locations for anobject x having joints J whose absolute locations are of interest; b.providing absolute 3D locations of at least one object y, which isimaged by said single camera and whose size is known; c. re-identifyingobjects x, y over time; d. recognizing interaction of object x asre-identified in operation c, with said object y of known size asre-identified in operation c; e. for each interaction recognized inoperation d, finding absolute 3D location L of object y with known size,at time T at which interaction occurred setting absolute 3D location ofjoint to be L using relative 3D locations as provided in operation a ofat least one joint k other than J relative to 3D location of joint J, tocompute absolute 3D locations of joint k, given the absolute location 1,of J as found; and determining distances between absolute 3D locationsof various joints k, to yield an object size parameter comprising anestimated length of a limb portion extending between joints j and k; f.computing a best-estimate from the object sizes estimated in operatione, for each tracked limb portion; g. using said best-estimate to scalerelative 3D joint locations into absolute 3D joint locations; and h.generating an output indication of said absolute 3D joint locations. 4.A method according to claim 3 wherein said object y is a stationaryobject having a. permanent absolute 3D location which is stored inmemory.
 5. A method according to claim 3 wherein said object y comprisesa ball and wherein data is pre-stored regarding the ball's knownconventional size.
 6. A method according to claim 3 wherein in operationc, said re-identifying objects comprises tracking said objects.
 7. Amethod according to claim 3 wherein in operation f, said computing of abest-estimate includes removing outliers which differ from a cluster ofestimates to a greater extent, relative to the extent to which theestimates in the cluster differ among themselves.
 8. A method accordingto claim 3 wherein said method is performed in near-real time.
 9. Amethod according to claim 3 wherein in operation h, said outputindication serves as an input to a processor which identifiesinteresting moments in the single camera's output and alerts end-usersof said interesting moments.
 10. A method according to claim 3 whereinin operation h, said output indication serves as an input to a processorwhich controls insertions of virtual advertisements.
 11. A methodaccording to claim 3 wherein said object y comprises an object O whoseabsolute 3d joint locations are known from operation h by virtue ofhaving previously derived absolute 3D locations of object O's joints byperforming said operations a-h.
 12. A method according to claim 3wherein said output indication may be sensed by a human user.
 13. Asystem according to claim 1 wherein said determining object y's absolute3D location comprises retrieving object y's absolute 3d location frommemory, because object y's absolute 3D location is known to the system.14. A system according to claim 1 wherein said processing circuitry isalso configured for: generating plural estimates of object x's size forplural interactions respectively; finding a best estimate from amongsaid estimates, over time, and using said best estimate to estimate theabsolute 3D location, thereby to handle situations in which object-jointdistance is not exactly
 0. 15. A system according to claim 1 whereinsaid object x's absolute location at time t is estimated as being equalto the 3D location, as known, at time t, of object y.
 16. A systemaccording to claim 1 wherein said object y's size is known and whereinsaid logic uses knowledge about said object y's size to derive saidobject x's absolute location at time t.
 17. A computer program product,comprising a non-transitory tangible computer readable medium havingcomputer readable program code embodied therein, said computer readableprogram code adapted to be executed to implement a method, said methodcomprising using a hardware processor for performing, at least once, allor any subset of the following operations: a. providing Relative 3Djoint locations for an object x having joints J whose absolute locationsare of interest; b. providing absolute 3D locations of at least oneobject y, which is imaged by said single camera and whose size is known;c. re-identifying objects x, y over time; d. recognizing interaction ofobject x as re-identified in operation c, with said object y of knownsize as re-identified in operation c; e. for each interaction recognizedin operation d, finding absolute 3D location L of object y with knownsize, at time T at which interaction occurred setting absolute 3Dlocation of joint J to be L using relative 3D locations as provided inoperation a of at least one joint k other than J relative to 3D locationof joint J, to compute absolute 3D locations of joint k, given theabsolute location L of J as found; and determining distances betweenabsolute 3D locations of various joints k, to yield an object sizeparameter comprising an estimated length of a limb portion extendingbetween joints j and k; f. computing a best-estimate from the objectsizes estimated in operation e, for each tracked limb portion; g. usingsaid best-estimate to scale relative 3D joint locations into absolute 3Djoint locations; and h. generating an output indication of said absolute3D joint locations.
 18. A system according to claim 1 wherein at leastone fixed size, known to remain fixed between frames, whichcharacterizes object x is used for converting object x's relative 3Djoint locations to absolute 3D joint locations for object x.
 19. Asystem according to claim 18 wherein said fixed size comprises an actualdistance between 2 connected joints belonging to object x.
 20. A systemaccording to claim 18 wherein said converting is performed for at leastone frame in which object x is not interacting with object y.